Analyzing a Greedy Approximation of an MDL Summarization

نویسنده

  • Peter Fontana
چکیده

Many OLAP (On-line Analytical Processing) applications have produced data cubes that summarize and aggregate details of data queries. These data cubes are multi-dimensional matrices where each cell that satisfies a specific property or trait is represented as a 1, notated as a 1-cell in this report. A cell that does not satisfy that specific property is represented as a 0, notated as a 0-cell. in this report In order to compress the amount of space required to represent this matrix completely, others have used MDL (Minimum Description Length) Summarization, including the MDL Summarization with Holes. While it is NP-Hard to compute the optimal MDL Summarization with Holes for a data matrix of 2 or more dimensions (Proven by Bu et al. [1]), there exists a greedy algorithm to approximate the MDL Summarization with Holes, proven to give an answer that within a factor of lm ∗ log(M) of the optimal solution (Proven by Guha and Tan [3]), where M is a factor dependent size of the data matrix. See the Technical Approach section of this report for a definition of lm. However, Guha and Tan in [3] mention that this bound has not been proven tight. I studied this for 2-dimensional matrices where the algorithm can only compress by covering rows and columns (here lm = 2). Currently, I have a proof that the greedy algorithm is a 4-approximation algorithm in this special 2-dimensional case and a constant-factor (2 ∗ (κ − 2))-approximation algorithm in the general case. Furthermore, I have written a program that uses the greedy approximation to MDL Summarize with Holes an arbitrary n-by-n 2-dimensional matrix of 1’s and 0’s.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MDL Summarization with Holes

Summarization of query results is an important problem for many OLAP applications. The Minimum Description Length principle has been applied in various studies to provide summaries. In this paper, we consider a new approach of applying the MDL principle. We study the problem of finding summaries of the form S H for k-d cubes with tree hierarchies. The S part generalizes the query results, while...

متن کامل

Krimping texts for better summarization

Automated text summarization is aimed at extracting essential information from original text and presenting it in a minimal, often predefined, number of words. In this paper, we introduce a new approach for unsupervised extractive summarization, based on the Minimum Description Length (MDL) principle, using the Krimp dataset compression algorithm (Vreeken et al., 2011). Our approach represents ...

متن کامل

Multi-document Summarization via Budgeted Maximization of Submodular Functions

We treat the text summarization problem as maximizing a submodular function under a budget constraint. We show, both theoretically and empirically, a modified greedy algorithm can efficiently solve the budgeted submodular maximization problem near-optimally, and we derive new approximation bounds in doing so. Experiments on DUC’04 task show that our approach is superior to the bestperforming me...

متن کامل

A Study of Global Inference Algorithms in Multi-document Summarization

In this work we study the theoretical and empirical properties of various global inference algorithms for multi-document summarization. We start by defining a general framework and proving that inference in it is NP-hard. We then present three algorithms: The first is a greedy approximate method, the second a dynamic programming approach based on solutions to the knapsack problem, and the third...

متن کامل

Greedy and Relaxed Approximations to Model Selection: A simulation study

The Minimum Description Length (MDL) principle is an important tool for retrieving knowledge from data as it embodies the scientific strife for simplicity in describing the relationship among variables. As MDL and other model selection criteria penalize models on their dimensionality, the estimation problem involves a combinatorial search over subsets of predictors and quickly becomes computati...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007